%load_ext google.cloud.bigquery
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, accuracy_score,recall_score
import pandas as pd
import plotly.express as px
import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt # add to requaierments
%matplotlib inline
The following dataset contains climate information form over 9000 stations accross the world. The overall goal of these subtasks will be to predict whether it will snow tomorrow 11 years ago. So if today is 2021.02.15 then the weather we want to forecast is for the date 2010.02.16. You are suppsed to solve the tasks using Big Query, which can be used in the Jupyter Notebook like it is shown in the following cell. For further information and how to used BigQuery in Jupyter Notebook refer to the Google Docs.
The goal of this test is, to test your coding knowledge in Python, BigQuery and Pandas as well as your understanding of Data Science. If you get stuck at the first part, you can use the replacement data provided in the second part
%%bigquery
SELECT
*,
FROM `bigquery-public-data.samples.gsod`
LIMIT 20
Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 270.67query/s] Downloading: 100%|██████████| 20/20 [00:02<00:00, 8.00rows/s]
| station_number | wban_number | year | month | day | mean_temp | num_mean_temp_samples | mean_dew_point | num_mean_dew_point_samples | mean_sealevel_pressure | ... | min_temperature | min_temperature_explicit | total_precipitation | snow_depth | fog | rain | snow | hail | thunder | tornado | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33790 | 99999 | 1930 | 7 | 19 | 59.000000 | 4 | 55.299999 | 4.0 | NaN | ... | NaN | None | NaN | NaN | False | False | False | False | False | False |
| 1 | 30910 | 99999 | 1931 | 2 | 12 | 38.000000 | 4 | NaN | NaN | NaN | ... | NaN | None | NaN | NaN | False | False | False | False | False | False |
| 2 | 726810 | 24131 | 1931 | 7 | 9 | 83.900002 | 24 | 40.900002 | 6.0 | NaN | ... | NaN | None | 0.0 | NaN | True | True | True | True | True | True |
| 3 | 726810 | 24131 | 1931 | 7 | 20 | 86.300003 | 24 | 30.900000 | 8.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 4 | 726810 | 24131 | 1931 | 8 | 30 | 74.300003 | 24 | 38.500000 | 8.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 5 | 726810 | 24131 | 1931 | 12 | 4 | 31.400000 | 24 | 29.700001 | 8.0 | NaN | ... | NaN | None | NaN | NaN | True | True | True | True | True | True |
| 6 | 726810 | 24131 | 1931 | 1 | 5 | 34.900002 | 24 | 23.400000 | 7.0 | NaN | ... | NaN | None | NaN | NaN | False | False | False | False | False | False |
| 7 | 726810 | 24131 | 1931 | 5 | 20 | 52.900002 | 24 | 25.400000 | 8.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 8 | 726810 | 24131 | 1931 | 8 | 8 | 69.900002 | 24 | 23.799999 | 8.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 9 | 33960 | 99999 | 1932 | 5 | 31 | 48.200001 | 4 | 47.299999 | 4.0 | NaN | ... | NaN | None | 0.0 | NaN | True | True | True | True | True | True |
| 10 | 726810 | 24131 | 1932 | 11 | 19 | 38.400002 | 24 | 37.599998 | 8.0 | NaN | ... | NaN | None | NaN | NaN | True | True | True | True | True | True |
| 11 | 726815 | 24106 | 1932 | 6 | 21 | 71.099998 | 24 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 12 | 726810 | 24131 | 1932 | 6 | 16 | 60.700001 | 24 | 46.400002 | 8.0 | NaN | ... | NaN | None | NaN | NaN | False | False | False | False | False | False |
| 13 | 726815 | 24106 | 1932 | 5 | 11 | 63.200001 | 24 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 14 | 726815 | 24106 | 1932 | 7 | 4 | 63.599998 | 24 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 15 | 370310 | 99999 | 1933 | 9 | 12 | 58.000000 | 4 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 16 | 307580 | 99999 | 1933 | 12 | 20 | -13.000000 | 4 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 17 | 296340 | 99999 | 1933 | 10 | 25 | 19.000000 | 4 | NaN | NaN | NaN | ... | NaN | None | NaN | NaN | False | False | False | False | False | False |
| 18 | 370500 | 99999 | 1933 | 2 | 6 | 23.700001 | 4 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | True | True | True | True | True | True |
| 19 | 239140 | 99999 | 1933 | 11 | 20 | 25.500000 | 4 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
20 rows × 31 columns
Change the date format to 'YYYY-MM-DD' and select the data from 2006 till 2010 for station numbers including and between 725300 and 726300 , and save it as a pandas dataframe. Note the maximum year available is 2010.
%%bigquery df1
SELECT * FROM (SELECT *,CAST(CONCAT(year,'-',month,'-',day) AS DATETIME) AS date
FROM `bigquery-public-data.samples.gsod`)
WHERE date BETWEEN '2006-01-01' AND '2010-12-31'
AND station_number BETWEEN 725300 AND 726300
Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 211.61query/s] Downloading: 100%|██████████| 323801/323801 [00:07<00:00, 41652.55rows/s]
#remove columns 'year','month','day'
df1 = df1.drop(['year','month','day'],axis=1)
df1.head()
| station_number | wban_number | mean_temp | num_mean_temp_samples | mean_dew_point | num_mean_dew_point_samples | mean_sealevel_pressure | num_mean_sealevel_pressure_samples | mean_station_pressure | num_mean_station_pressure_samples | ... | min_temperature_explicit | total_precipitation | snow_depth | fog | rain | snow | hail | thunder | tornado | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 725835 | 99999 | 67.099998 | 4 | 25.200001 | 4.0 | NaN | NaN | NaN | NaN | ... | None | 0.00 | NaN | False | False | False | False | False | False | 2006-09-18 |
| 1 | 725869 | 99999 | 66.699997 | 4 | 51.799999 | 4.0 | NaN | NaN | NaN | NaN | ... | None | 0.11 | NaN | False | False | False | False | False | False | 2006-07-03 |
| 2 | 725868 | 99999 | 46.400002 | 6 | 38.000000 | 6.0 | NaN | NaN | NaN | NaN | ... | None | 0.00 | NaN | False | False | False | False | False | False | 2006-11-21 |
| 3 | 725869 | 99999 | 48.700001 | 7 | 36.599998 | 7.0 | NaN | NaN | NaN | NaN | ... | None | 0.00 | NaN | False | False | False | False | False | False | 2006-04-08 |
| 4 | 725868 | 99999 | 30.200001 | 8 | 30.000000 | 8.0 | NaN | NaN | NaN | NaN | ... | None | 0.25 | 28.0 | True | True | True | True | True | True | 2006-01-30 |
5 rows × 29 columns
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 323801 entries, 0 to 323800 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 station_number 323801 non-null int64 1 wban_number 323801 non-null int64 2 mean_temp 323801 non-null float64 3 num_mean_temp_samples 323801 non-null int64 4 mean_dew_point 323667 non-null float64 5 num_mean_dew_point_samples 323667 non-null float64 6 mean_sealevel_pressure 198979 non-null float64 7 num_mean_sealevel_pressure_samples 198979 non-null float64 8 mean_station_pressure 5365 non-null float64 9 num_mean_station_pressure_samples 5365 non-null float64 10 mean_visibility 315738 non-null float64 11 num_mean_visibility_samples 315738 non-null float64 12 mean_wind_speed 323443 non-null float64 13 num_mean_wind_speed_samples 323443 non-null float64 14 max_sustained_wind_speed 322866 non-null float64 15 max_gust_wind_speed 214240 non-null float64 16 max_temperature 323721 non-null float64 17 max_temperature_explicit 323721 non-null object 18 min_temperature 0 non-null float64 19 min_temperature_explicit 0 non-null object 20 total_precipitation 279011 non-null float64 21 snow_depth 6015 non-null float64 22 fog 323801 non-null bool 23 rain 323801 non-null bool 24 snow 323801 non-null bool 25 hail 323801 non-null bool 26 thunder 323801 non-null bool 27 tornado 323801 non-null bool 28 date 323801 non-null datetime64[ns] dtypes: bool(6), datetime64[ns](1), float64(17), int64(3), object(2) memory usage: 58.7+ MB
#check min/start date
print('The earliest date in df {}'.format(df1.date.min()))
print('The latest date in df {}'.format(df1.date.max()))
The earliest date in df 2006-01-01 00:00:00 The latest date in df 2010-04-16 00:00:00
Comments: testing simple logical conditining, knowledge about sql syntax, and ability to find in doc how to store as df variable
From here want to work with the data from all stations that have information from 2005 till 2010. Select the relevant data.
%%bigquery df
SELECT * FROM (SELECT *, CAST(CONCAT(year,'-',month,'-',day) AS DATETIME) AS date
FROM `bigquery-public-data.samples.gsod`)
WHERE date BETWEEN '2005-01-01' AND '2010-12-31'
AND station_number BETWEEN 725200 AND 726400
Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 191.17query/s] Downloading: 100%|██████████| 488117/488117 [00:11<00:00, 43080.25rows/s]
Because the dataset without limitations had more than 17000000 rows and I decided to limit data to Stations number between 725200 AND 726400
#save the data to pickle
#df.to_pickle('weather.pickle')
# plot only for one station
def effect_on_weather_change_in_time_on_snow(df, column_name, station_number):
df_one_station = df[df['station_number'] == station_number]
f = px.scatter(df_one_station, x = df_one_station.index, y = column_name, color = 'snow',title = column_name)
return f
# load data, drop column set index and print info about data frame
df = pd.read_pickle('weather.pickle')
df.set_index('date',inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 488117 entries, 2005-11-16 to 2010-04-09 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 station_number 488117 non-null int64 1 wban_number 488117 non-null int64 2 year 488117 non-null int64 3 month 488117 non-null int64 4 day 488117 non-null int64 5 mean_temp 488117 non-null float64 6 num_mean_temp_samples 488117 non-null int64 7 mean_dew_point 487820 non-null float64 8 num_mean_dew_point_samples 487820 non-null float64 9 mean_sealevel_pressure 304905 non-null float64 10 num_mean_sealevel_pressure_samples 304905 non-null float64 11 mean_station_pressure 15572 non-null float64 12 num_mean_station_pressure_samples 15572 non-null float64 13 mean_visibility 477829 non-null float64 14 num_mean_visibility_samples 477829 non-null float64 15 mean_wind_speed 487499 non-null float64 16 num_mean_wind_speed_samples 487499 non-null float64 17 max_sustained_wind_speed 486655 non-null float64 18 max_gust_wind_speed 314736 non-null float64 19 max_temperature 487983 non-null float64 20 max_temperature_explicit 487983 non-null object 21 min_temperature 0 non-null float64 22 min_temperature_explicit 0 non-null object 23 total_precipitation 430854 non-null float64 24 snow_depth 13533 non-null float64 25 fog 488117 non-null bool 26 rain 488117 non-null bool 27 snow 488117 non-null bool 28 hail 488117 non-null bool 29 thunder 488117 non-null bool 30 tornado 488117 non-null bool dtypes: bool(6), float64(17), int64(6), object(2) memory usage: 99.6+ MB
Dataframe has 31 columns and 488117 rows. Data types in df bool, datetime64[ns], float64, int64, object. They are also missing values in data frame.
In first step I decided to drop following columns:
#percentage of NaN value
df.isna().sum()/len(df)
station_number 0.000000 wban_number 0.000000 year 0.000000 month 0.000000 day 0.000000 mean_temp 0.000000 num_mean_temp_samples 0.000000 mean_dew_point 0.000608 num_mean_dew_point_samples 0.000608 mean_sealevel_pressure 0.375344 num_mean_sealevel_pressure_samples 0.375344 mean_station_pressure 0.968098 num_mean_station_pressure_samples 0.968098 mean_visibility 0.021077 num_mean_visibility_samples 0.021077 mean_wind_speed 0.001266 num_mean_wind_speed_samples 0.001266 max_sustained_wind_speed 0.002995 max_gust_wind_speed 0.355204 max_temperature 0.000275 max_temperature_explicit 0.000275 min_temperature 1.000000 min_temperature_explicit 1.000000 total_precipitation 0.117314 snow_depth 0.972275 fog 0.000000 rain 0.000000 snow 0.000000 hail 0.000000 thunder 0.000000 tornado 0.000000 dtype: float64
# percentage of NaN values in 'snow_depth' column
df['snow_depth'].isna().sum()/len(df['snow_depth'])
0.9722750897838018
# shape of data frame before cleaning
df.shape
(488117, 31)
#head of data frame before cleaning
df.head()
| station_number | wban_number | year | month | day | mean_temp | num_mean_temp_samples | mean_dew_point | num_mean_dew_point_samples | mean_sealevel_pressure | ... | min_temperature | min_temperature_explicit | total_precipitation | snow_depth | fog | rain | snow | hail | thunder | tornado | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | |||||||||||||||||||||
| 2005-11-16 | 726395 | 14808 | 2005 | 11 | 16 | 36.000000 | 4 | 31.100000 | 4.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 2005-08-02 | 725496 | 99999 | 2005 | 8 | 2 | 86.900002 | 4 | 72.099998 | 4.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 2005-03-30 | 725488 | 99999 | 2005 | 3 | 30 | 52.200001 | 4 | NaN | NaN | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 2005-09-27 | 725868 | 99999 | 2005 | 9 | 27 | 51.799999 | 5 | 36.700001 | 5.0 | NaN | ... | NaN | None | 0.0 | NaN | False | False | False | False | False | False |
| 2005-03-19 | 725835 | 99999 | 2005 | 3 | 19 | 46.700001 | 6 | 33.400002 | 5.0 | NaN | ... | NaN | None | NaN | NaN | False | False | False | False | False | False |
5 rows × 31 columns
# recalculate temp from fahrenheit to celsius
df['mean_temp'] = (df['mean_temp']-32) * 5/9
df['max_temperature'] = (df['max_temperature']-32) * 5/9
# explore target for all station together
px.histogram(df, x='snow', color="snow", title = 'Frequence distribution of "Snow" variable')
The above plot show that we have:
We see that the dataset is unbalanced. This information will be important when we will be choose metrices to evaluate our model
# here we interprate our target for every station. The plot is interactive. You can yoom in and out to better see results.
df['station_number'] = df['station_number'].astype(str)
px.histogram(df, x='station_number', color="snow", barmode='group')
#plots for choosen columns and one station
columns_name = ['mean_temp','max_temperature','mean_dew_point']
for col in columns_name:
f = effect_on_weather_change_in_time_on_snow(df, col, '725869')
f.show()
def shift_col(df,list_of_col,shift_val):
"""
Add lagging features for columns in list of col.
"""
for col in list_of_col:
df[col+'_yesterday_'+str(shift_val)] = df[col].shift(shift_val)
def rolling_col(df,list_of_col,window_size):
"""
Add rolling average features for columns in list of col.
"""
for col in list_of_col:
df['avg_number_of_{}_last_{}_days'.format(col,window_size)] = df[col].rolling(10).mean()
def add_features(df,list_of_col,list_of_shifts,windows_list):
"""
Add features to data frame and prepare a target
"""
#will be used as target
df['snow_tomorrow'] = df.snow.shift(-1)
#add new features
for shift in list_of_shifts:
shift_col(df,list_of_col,shift)
for window in windows_list:
rolling_col(df,list_of_col,window)
if windows_list:
df= df[max(windows_list):-1]
else:
df= df[:-1]
return df
def plot_train_val_test_fold(X_train,X_val,i):
"""
Plot mean_temp for one station for every split
"""
X_train_plot = X_train[X_train.station_number == '726395' ]
X_train_plot['part'] = 'train'
X_train_plot = X_train_plot[['part','mean_temp']]
X_val_plot = X_val[X_val.station_number == '726395' ]
X_val_plot['part'] = 'validation'
X_val_plot = X_val_plot[['part','mean_temp']]
X_plot = pd.concat([X_train_plot,X_val_plot])
f = px.line(X_plot,x=X_plot.index,y=X_plot['mean_temp'], color='part', title = 'Fold {}'.format(i))
return f
def calculate_metrics(model,pred_val,y_val,type_of_data):
"""
Calculate accuracy, precision, recall
"""
pred_val = pred_val.astype(int)
y_val = y_val.astype(int)
result = {'model':str(model).split('(')[0],
'accuracy_{}'.format(type_of_data): accuracy_score(y_val,pred_val),
'precision_{}'.format(type_of_data):precision_score(y_val,pred_val),
'recall_{}'.format(type_of_data):recall_score(y_val,pred_val)
}
return result
Do a first analysis of the remaining dataset, clean or drop data depending on how you see appropriate.
#remove all columns which start with 'num_'
df = df[[col for col in df.columns if not col.startswith('num_')]]
#remove columns with big percentage of NaN
columns_to_drop = ['min_temperature', 'min_temperature_explicit','snow_depth','mean_station_pressure','mean_sealevel_pressure','max_gust_wind_speed']
df = df[df.columns[~df.columns.isin(columns_to_drop)]]
# check if station_number and wban_number are unique.
df.groupby('station_number')['wban_number'].value_counts()
station_number wban_number
725200 94823 1931
725204 99999 1380
4813 484
725205 14762 1931
725207 99999 1421
...
726394 99999 1419
4874 494
726395 14808 1923
726396 99999 1636
726400 14839 1930
Name: wban_number, Length: 424, dtype: int64
We see that one station can has more than one wban_number. To simplyfy analysis I will drop columns with wban_number. I decided also to drop max_temp_explicit, tornado.
df = df.drop(['wban_number', 'max_temperature_explicit', 'tornado'], axis = 1)
df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 488117 entries, 2005-11-16 to 2010-04-09 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 station_number 488117 non-null object 1 year 488117 non-null int64 2 month 488117 non-null int64 3 day 488117 non-null int64 4 mean_temp 488117 non-null float64 5 mean_dew_point 487820 non-null float64 6 mean_visibility 477829 non-null float64 7 mean_wind_speed 487499 non-null float64 8 max_sustained_wind_speed 486655 non-null float64 9 max_temperature 487983 non-null float64 10 total_precipitation 430854 non-null float64 11 fog 488117 non-null bool 12 rain 488117 non-null bool 13 snow 488117 non-null bool 14 hail 488117 non-null bool 15 thunder 488117 non-null bool dtypes: bool(5), float64(7), int64(3), object(1) memory usage: 47.0+ MB
We still have NaN values in five columns. I will use ffill() to fill this value
I'm lagging features 'snow' and 'mean_temp' with the assumption that this are were important features.
# sort df
df = df.sort_values(['year','month','day'])
#without extra feartures
#list_of_col,list_of_shifts,windows_list = [],[],[]
list_of_col = ['snow','mean_temp']
list_of_shifts = [1,2]
windows_list = [5,10]
collect = []
for station_number, group in df.groupby('station_number'):
# ‘forward fill’ propagates last valid observation forward, in case when first value is NaN we do drop after ffill()
per_station_df = group.ffill().dropna()
per_station_df = add_features(per_station_df,list_of_col,list_of_shifts,windows_list)
per_station_df['station_number'] = station_number
collect.append(per_station_df)
df_final = pd.concat(collect)
df_final.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 475095 entries, 2005-01-11 to 2010-04-15 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 station_number 475095 non-null object 1 year 475095 non-null int64 2 month 475095 non-null int64 3 day 475095 non-null int64 4 mean_temp 475095 non-null float64 5 mean_dew_point 475095 non-null float64 6 mean_visibility 475095 non-null float64 7 mean_wind_speed 475095 non-null float64 8 max_sustained_wind_speed 475095 non-null float64 9 max_temperature 475095 non-null float64 10 total_precipitation 475095 non-null float64 11 fog 475095 non-null bool 12 rain 475095 non-null bool 13 snow 475095 non-null bool 14 hail 475095 non-null bool 15 thunder 475095 non-null bool 16 snow_tomorrow 475095 non-null object 17 snow_yesterday_1 475095 non-null object 18 mean_temp_yesterday_1 475095 non-null float64 19 snow_yesterday_2 475095 non-null object 20 mean_temp_yesterday_2 475095 non-null float64 21 avg_number_of_snow_last_5_days 475095 non-null float64 22 avg_number_of_mean_temp_last_5_days 475095 non-null float64 23 avg_number_of_snow_last_10_days 475095 non-null float64 24 avg_number_of_mean_temp_last_10_days 475095 non-null float64 dtypes: bool(5), float64(13), int64(3), object(4) memory usage: 78.4+ MB
Data without NaN, new columns were added
today_11_years_ago = str(datetime.datetime.today()- datetime.timedelta(days=11*365)).split(' ')[0]
tomorrow_11_years_ago = str((datetime.datetime.today()- datetime.timedelta(days=11*365)) + datetime.timedelta(days=1)).split(' ')[0]
df_final = df_final.sort_values(['year','month','day'])
#we drop feature 'year' bacause we don't need it anymore
df_final = df_final.drop('year', axis=1)
#split train/test
# df_train will be use in cross validation for time series
df_train = df_final[df_final.index < today_11_years_ago]
# df_test will be used only for final test
df_test = df_final[(df_final.index == today_11_years_ago)]
y = df_train['snow_tomorrow']
X = df_train.drop('snow_tomorrow',axis=1)
y_test = df_test['snow_tomorrow']
X_test = df_test.drop('snow_tomorrow',axis=1)
For modeling I decided to use RandomForestClassifier. I used cross validation TimeSeriesSplit this works like a expanding window on the time dimension.
For example I used data for year 2006 and validate on 2007, then in the next fold I used the data from 2006 and 2007 and validate on 2008 etc. (see plots below)
Accuracy, precision and recall for train, valid and test were calculated.
rfc = RandomForestClassifier(n_estimators=5,max_depth=10)
tscv = TimeSeriesSplit(n_splits = 4)
models_list = [rfc]
metrics = []
for model in models_list:
for i,(train_index, val_index) in enumerate(tscv.split(X)):
X_train, X_val = X.iloc[train_index], X.iloc[val_index]
y_train, y_val = y.iloc[train_index], y.iloc[val_index]
model.fit(X_train,y_train.astype(int))
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)
pred_test = model.predict(X_test)
metrics_baseline = calculate_metrics(model,X_val.snow,y_val,'baseline')
metrics_train = calculate_metrics(model,pred_train,y_train,'train')
metrics_val = calculate_metrics(model,pred_val,y_val,'validation')
metrics_test = calculate_metrics(model,pred_test,y_test,'test')
metrics_baseline.update(metrics_train)
metrics_baseline.update(metrics_val)
metrics_baseline.update(metrics_test)
metrics_baseline.update({'fold':i})
metrics.append(metrics_baseline)
f = plot_train_val_test_fold(X_train,X_val,i)
f.show()
res = pd.DataFrame(metrics)
res[[col for col in res.columns if not col.startswith(('prec','rec'))]]
| model | accuracy_baseline | accuracy_train | accuracy_validation | accuracy_test | fold | |
|---|---|---|---|---|---|---|
| 0 | RandomForestClassifier | 0.894101 | 0.860759 | 0.921034 | 0.932 | 0 |
| 1 | RandomForestClassifier | 0.878605 | 0.890638 | 0.907373 | 0.940 | 1 |
| 2 | RandomForestClassifier | 0.878508 | 0.896349 | 0.910805 | 0.936 | 2 |
| 3 | RandomForestClassifier | 0.868700 | 0.900012 | 0.906337 | 0.956 | 3 |
res[[col for col in res.columns if not col.startswith('acc')]]
| model | precision_baseline | recall_baseline | precision_train | recall_train | precision_validation | recall_validation | precision_test | recall_test | fold | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RandomForestClassifier | 0.326335 | 0.326514 | 0.735937 | 0.408545 | 0.492078 | 0.149252 | 0.333333 | 0.307692 | 0 |
| 1 | RandomForestClassifier | 0.351382 | 0.351788 | 0.724738 | 0.292769 | 0.518550 | 0.138639 | 0.400000 | 0.307692 | 1 |
| 2 | RandomForestClassifier | 0.334358 | 0.334398 | 0.710344 | 0.233304 | 0.551206 | 0.121556 | 0.200000 | 0.076923 | 2 |
| 3 | RandomForestClassifier | 0.335486 | 0.334534 | 0.722064 | 0.187402 | 0.630836 | 0.126828 | 0.750000 | 0.230769 | 3 |
A baseline model predicts tommorow snow based only on if it's snowing today.
I calculated precision and recall because dataset is unbalanced. We have more no-snowing days then snowing.
# Predict snow for tomorrow.
# Here I used whole X,y to train and predict only for one day
rfc.fit(X,y.astype(int))
pred = rfc.predict(X_test)
metrics_final = calculate_metrics(rfc,pred,y_test.values,'final prediction for test')
pd.DataFrame([metrics_final])
| model | accuracy_final prediction for test | precision_final prediction for test | recall_final prediction for test | |
|---|---|---|---|---|
| 0 | RandomForestClassifier | 0.956 | 0.666667 | 0.307692 |
Modeling for every station and weather prediction for tommorow.
rfc = RandomForestClassifier(n_estimators=5,max_depth=10)
tscv = TimeSeriesSplit(n_splits = 4)
models_list = [rfc]
metrics = []
for station, data in df_train.groupby('station_number'):
y = data['snow_tomorrow']
X = data.drop(['snow_tomorrow','station_number'],axis=1)
df_test_station = df_test[df_test.station_number == station]
y_test = df_test_station['snow_tomorrow']
X_test = df_test_station.drop(['snow_tomorrow','station_number'],axis=1)
model.fit(X,y.astype(int))
try:
pred_train = model.predict(X)
pred_test = model.predict(X_test)
metrics_train = calculate_metrics(model,pred_train,y,'train')
metrics_train.update({'station':station})
metrics_train.update({'y_pred':pred_test[0]})
metrics_train.update({'y_true':y_test.iloc[0]})
metrics.append(metrics_train)
except:
print('No data for {} in test'.format(station))
No data for 725291 in test No data for 725473 in test No data for 725476 in test No data for 725478 in test No data for 725765 in test No data for 725786 in test No data for 725827 in test No data for 725835 in test No data for 725848 in test No data for 726388 in test
a = pd.DataFrame(metrics)
a
| model | accuracy_train | precision_train | recall_train | station | y_pred | y_true | |
|---|---|---|---|---|---|---|---|
| 0 | RandomForestClassifier | 0.951415 | 0.918367 | 0.705882 | 725200 | 0 | False |
| 1 | RandomForestClassifier | 0.966722 | 0.952381 | 0.408163 | 725204 | 0 | False |
| 2 | RandomForestClassifier | 0.907101 | 0.938356 | 0.453642 | 725205 | 0 | False |
| 3 | RandomForestClassifier | 0.983440 | 1.000000 | 0.425926 | 725207 | 0 | False |
| 4 | RandomForestClassifier | 0.929266 | 0.954023 | 0.395238 | 725208 | 0 | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 245 | RandomForestClassifier | 0.990930 | 1.000000 | 0.739130 | 726392 | 0 | True |
| 246 | RandomForestClassifier | 0.980054 | 0.986667 | 0.672727 | 726394 | 0 | False |
| 247 | RandomForestClassifier | 0.991957 | 1.000000 | 0.732143 | 726395 | 0 | False |
| 248 | RandomForestClassifier | 0.996200 | 1.000000 | 0.666667 | 726396 | 0 | True |
| 249 | RandomForestClassifier | 0.928953 | 0.968553 | 0.546099 | 726400 | 0 | False |
250 rows × 7 columns
precision_score(a.y_true,a.y_pred)
0.24
What can leads to better results?